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PREFACE 


This report draws attention to the frequent, but often neglected, need to 
force a regression line through a known point while obtaining the best possi- 
ble fit to all experimental data points. A simple method is described for 
solving this problem without modifying customary computational routines. This 
method can be applied to many problems, but is especially useful when cali- 
brating empirical prediction formulas to fit site-specific coastal conditions 
or when choosing from among several theoretical prediction models. The work 
was carried out under the U.S. Army Coastal Engineering Research Center's 
(CERC) Shore Response to Offshore Dredging work unit, Shore Protection and 
Restoration Program, Coastal Engineering Area of Civil Works Research and 
Development. 
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FIGURES 
Application of Model I produces an intercept (a), which may be a 
useful estimate of a component of longshore flow which is independ- 


ent of wave conditions and presumably pervades the entire data set. . 10 
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CONVERSION FACTORS, U.S. CUSTOMARY TO METRIC (SI) UNITS OF MEASUREMENT 


U.S. customary units of measurement used in this report can be converted to 
metric (SI) units as follows: 


Multiply by To obtain 


inches 25.4 millimeters 
2.54 centimeters 
Square inches 6.452 Square centimeters 
cubic inches 16.39 cubic centimeters 
feet 30.48 centimeters 
0.3048 meters 
square feet 0.0929 Square meters 
cubic feet 0.0283 cubic meters 
yards 0.9144 meters 
Square yards 0.836 square meters 
cubic yards 0.7646 cubic meters 
miles 1.6093 kilometers 
Square miles 259.0 hectares 
knots 1.852 kilometers per hour 
acres 0.4047 hectares 
foot—pounds 1.3558 newton meters 
millibars 1.0197 x 1073 kilograms per square centimeter 
ounces 28.35 grams 
pounds 453.6 grams 
0.4536 kilograms 
ton, long 1.0160 metric tons 
ton, short 0.9072 metric tons 
degrees (angle) 0.01745 radians 
1 


Fahrenheit degrees 5/9 Celsius degrees or Kelvins 


lt) obtain Celsius (C) temperature readings from Fahrenheit (F) readings, 
use formula: C = (5/9) (F -32). 
To obtain Kelvin (K) readings, use formula: K = (5/9) (F -32) + 273.15. 
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SYMBOLS AND DEFINITIONS 


The F-value may be produced by a multiple regression program and 

is analogous to the t-value in simple regression (one independent varia- 
ble). The F-value indicates the "significance" of r* and is useful 

in selecting the most important independent variables. 


p 2G =D" (a-2=2) abaecoen, /Misapie 
E(y - y)? P Ls 62 P 


height of breaking waves 
size of the sample 


total number of independent variables. Caution, several observed car- 
riers may end up combined into a single independent variable; e.g., 

X= (gH,) 1/2 sin 2a, has two distinct carriers (Hj, and ap) but is 
one independent variable (see example problem 1). The value of p will 
be one less than the number of constants to be estimated in Model I, 

and is equal to the number of constants in Model II. 


sample correlation coefficient. The r-value produced by regression 
partially measures the closeness of fit between the linear predictor and 
data. Its square is called the coefficient of determination. 


ia NS me ane 

eB = Boas = 2y y) & a) (Model 1) 
(y - y)? i(y - y)* U(x - x)? 

r2 = see ES caus (Model IT) 
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sum of squares of x may be produced by the regression program 
and is useful for computing other values, e.g., Sg. 


= =) 2 
SS. = Gs = 2) 
standard error of the estimated slope, 


: S$ ox 
B scm 
The larger Sg, the less reliable is the estimate of slope. 


unbiased estimator of the variance of the random component €, e.g., 


ig = y)2 
2 eo eet Atal . 
Sy ox Apel in Model I 


The number of independent variables, p, is 1 in simple regression 
with Model I. The mean square deviation from regresston corresponds to 


> 


the simple variance used to measure the spread of values in a single 

data set. It is also sometimes called the standard error of the estimate. 
The value produced by regression to indicate uncertainty of the esti- 
mated y; the value Boose depends on the variances of all the estimated 
coefficients. 

The t-value produced in simple regression to test whether the estimated 
regression coefficient is "significantly" different from zero. 


longshore current velocity 
independent variable in regression 


observed values of X. A string of n-values in simple regression; a 
n by p matrix in multiple regression 


dependent variable to be estimated 

n observed values of Y 

estimated value of Y for given values of X 

Y-intercept in a regression model 

angle between the crest of the breaking wave and the shoreline 


estimated regression coefficients in multiple regression or the slope of 
the line in simple regression 
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8 = ——————————_ (Model I) 
I(x - x)2 
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zero-mean random component of Y assumed by both regression models 


atti ae yo Wes - : 
Aveta, aia brat was tesa gph. 
ie ta wat ee) sibaraien ol 


- Pre). 


he 
Lod 


pie = wh mb ae “oka vey vn ba weve bog badness 
o RM Ph a 
re ee oboe ar bao at a St ‘seston a 
ee ae Ar ca 
 ieiteonlld rin tos rie eat any css se i cabs ae) ats a 


er a. wecknaasgsi Sythe ak eae Mand: (utes stant “ator ses 
2 ' ; Pale yeti! ae 2 RRR a> wd aiid 
y - f a a A - S 


Se a eee 
— Rye e.g 
oF ae Pes A 


Pay Oe 
Ok, 1 Daeboh): Sle 


View 


"Sh Tipowtidess aolincs gaaeaes 


FORCING REGRESSION THROUGH A GIVEN POINT USING ANY 
FAMILIAR COMPUTATIONAL ROUTINE 


by 
Edward B. Hands 


I. INTRODUCTION TO REGRESSION 


The engineer frequently needs to estimate some response or dependent variable 
Y (e.g., sand transport rate, change in shoreline position, or structural dam- 
age), when given the magnitude of other factors, or independent variables X 
(e.g., longshore wave energy flux, storm frequency, elevation of storm surges, 
etc.). A common approach is to assume a linear model, 


Y = a + 8X + € (Model I) 


then adopt the principle of least squares; and use sample data to estimate the 
unknown parameters, a and 8. Both 8 and X can be considered as strings 
of numbers in the case of multiple regression with several independent varia- 
bles; e¢ indicates that the response is not being thought of as an exact linear 
function of X. The e€ represents random and unpredictable elements in Y; 
therefore, e¢ does not appear in the prediction equation: y = a+ 8x, where 
eCard 8 are estimates of the corresponding components in the conceptual 
Model I. The assumption that e has an expected value of zero indicates that 
the “average'' response is considered linear. If e varies widely, Model I, 
though conceptually correct, may have only limited predictive value. In such 

a case the estimated mean value of Y would frequently be thrown off by noise 
in the data. If ce varies only slightly, good predictions will be possible 
provided good estimates of a and 8 are available. Adopting the principle 
of least squares means one is willing to define the best estimates of a and 

8 as those that minimize the sum of the squares of the deviations between the 
observed and predicted values (i.e., y and y). 


Customarily, no constraints are placed on the contenders for the best fit 
line. Of all possible lines in the XY plane, the prediction equation is 
chosen because it has the least sums of squares of deviations in y's from the 
data points. The y-intercept, a, is the point where the best fit line inter- 
sects the Y-axis. The a may be of Special interest, e.g., in the regression 
of current speed against longshore wave energy flux measured in a field test 
(Fig. 1). An intercept substantially above zero would suggest that during the 
test a component of the longshore current was driven by mechanisms other than 
waves (e.g., tides or winds). In this case, the nonzero intercept would not 
only be meaningful, but would also provide a good estimate of the velocity of 
any steady, nonwave-generated coastal current during the test. 


An additional example of unconstrained regression would be where greater 
and greater structural damage occurs as the wave forces exceed an undetermined 
threshold value. Again Model I applies and produces the correct regression 
coefficient (8). In the process it produces a meaningless response intercept 
well below zero (Fig. 2). In contrast with the previous example, the interest 
here is strictly in the prediction of future damage for given wave forces, not 
in the value of the intercept itself. The resulting linear relationship applies 
only to values of the independent variable above the threshold of wave effect. 
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Figure 2. 


Flow Rate 


Wave Energy Flux 


Application of Model I produces an intercept (a), 
which may be a useful estimate of a component of 
longshore flow which is independent of wave con- 
ditions and presumably pervades the entire data 
set. 


Wave Forces 


Vi. s—Negative Intercept 


Application of Model I identified a thresheld value 
below which waves cause no damage. A negative inter- 
cept is produced, but is of no interest in this 
particular problem. 


Although the negative intercept (a) is in itself meaningless, Model I is 
correct because there is no basis for constraining 4. 


II. A PROBLEM WITH THE CUSTOMARY APPROACH 


There are many cases where the logic of the application dictates the 
response at a particular value of X. For example, if the response is some 
change that is regressed against time then the response must be 0 when X= 0 
(Fig. 3). If there is no elapse time, there can be no change. If the linear 
assumption is valid, the appropriate conceptual mode is 


Y = 8 X + ©€ (Model IT) 


and the customary predictive equation (based on Model I) is inappropriate and 
May give poor estimates of 8 (see Fig. 4). Yet the vast majority of regres-— 
sion programs (e.g., SPSS, IMSL, IBM's 5110 package, and TI-59) do not allow 
specification of a zero intercept or any constraint through a known point. 
Statistical texts usually do not cover this topic either. However, formulas 
for the zero-intercept case are given by Brownlee (1965) and Krumbein (1965). 


Figure 3. Application of Model II forces a zero-intercept solution. 


Y 


A 
Model 11> 8 =0.63 
“_= Model 1 > 8 =0.34 


Figure 4. Model II estimates an increase in Y per unit increase in X 
that is nearly twice that predicted using Model I. The phy- 
sical relationship between X and Y dictates which model 
should be adopted. If Model II is appropriate the solution can 
be obtained using a simple artifice described in this report 
to modify results of standard computer programs intended for 
Model I. 
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The value of Y may be known for a single value of X (mot necessarily 0). 
The best prediction should then be sought from among the limited subset of 
lines through this point. All these lines will have a larger sum of squares 
(Z[y - y]?) than the line that would have been selected by Model I. A simple 
procedure is described herein for picking from among these restricted candidates 
the one with the smallest ‘Z[y - y]*. Thus, regressing through the origin is 
but one specific case that can be solved by a general model forcing regression 
at an arbitrary point. 


III. SOLUTION TO THE PROBLEM 


This report describes a method for getting the best fit to all data points 
(in the sense of least squares) while forcing an exact fit at any known point. 
A simple procedure for forcing regression through the origin was described by 
Hawkins (1980), who indicated the procedure was not well known. The author of 
this report knows of no references to the general case of an exact fit to an 
arbitrary point. However, if a fit can be constrained through the origin, then 
a simple transform of variables can force the line through any given point. 
The details of the through-the-origin solution will be explained first. 


1. Regression Through the Origin. 


For each set of measured dependent and independent variables observed 
(yj, x4), also enter, or program, a mirror-image set (-y,, =x; ))- Thus), the 
computer is given an extended data set consisting of 2n data points, only n 
of which were observed. By definition of this extended data set, the depend- 
ent and all the independent variables each individually sum to zero, forcing 
a zero intercept: 


qa by the principle of least squares 


u 
< 
| 
DR 
bea 
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a = 0 because ‘x and Yy = 0 and thus 
xX = y = O on the extended data set 


Thus a zero-intercept solution is obtained. Is it still the least squares 
solution for the observed data set? The principle of least squares by defini- 
tion minimizes the sum of the squares of the deviations of the observed from 
the predicted values. Because each squared deviation from the observed data 
set generates an identical squared deviation in the extended data set, the sum 
of these two positive sequences is minimized over the extended data set only 
if it is also minimized over both the observed and the mirror-image sets. 
Thus, the regression coefficient produced in this manner; not only the least 
Squares solution for the artificially extended data set, but for the observed 
data set as well. By this artifice the proper estimate is obtained for the 
regression coefficient (8) with the prediction forced through the origin. 


2. Regression Through Any Arbitrary Point (a, b). 


If the predicted response (Y) must be a when the independent variables 
(X) are b, then regress an extended data set u on v, where u=x-a 
and v=y-b. If (a, b) = (0, 0), then this collapses to the exact 
situation described above. If (a, b) # (0, 0), the direct results, wu = Bv, 
should be unraveled to produce the y prediction: 
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y - b = B(x - a) 


(b - aB) + Bx 


M< 
ll 


NOTE: The proper estimate of the regression coefficient (8) now forces the 
prediction through the point (a, b) as desired. By using this procedure the 
correct regression coefficient is obtained by using any familiar computational 
routines. The second most frequently reported output from regression programs, 
the correlation coefficient (r), is also the correct, unbiased estimator for 
Model Il. 


If additional information is provided by the regression program, then 
corrections may be necessary before adopting them for the real data set. The 
estimate of the residual variance will be correct for simple regression (one 
independent variable) and can be easily adjusted for multiple regression (see 
Table 1). Any sums of squares, cross products, and F-values produced by the 
program will be exactly twice the correct values. The standard error of the 
estimated slope will be too small by a factor of V2. Therefore, the t-value, 
for testing the zero slope hypothesis, will be too large by the same factor. 


Table 1 indicates the corrections for most of the elements produced by 
various .cgression programs. However, employing the described extended data 
procedures does not require consideration of any part of the output beyond that 
used in the standard unconstrained approach. 


IV. SELECTING BETWEEN MODELS I AND II 


If either the true or mean value (whichever interpretation fits the situa-— 
tion) of the dependent variable (Y) is unknown for all values of the independ- 
ent variable in the range of concern, then the customary model (I) may be 
appropriate. However, if the postulated physical relationship between X and 
Y dictates constraint through any point (a, b) and the relationship is linear 
from the maximum observed x to x = a, then Model II should be used. To pro- 
ceed with the customary evaluation of Model I would be equivalent to ignoring 
what is already known about the relationship between X and Y and, instead, 
relying totally on the limited information available in the sample data. The 
objective should be to obtain the best interpretation of the data, which does 
not override any more firmly established understanding of the situation. 


Assuming Model II applies, it may still be useful to evaluate Model I to 
test in the conventional way (Draper and Smith, 1966) the significance of the 
estimated nonzero intercept. If this test fails to provide enough evidence to 
reject the strawman hypothesis (H,: a= 0) then this failure may be cited as 
additional evidence strictly from the data, substantiating the choice of Model 
II to estimate 8. The results of this formal test of hypothesis should not, 
however, be relied on as the criterion for selecting Model II. It should serve 
only as a source of auxiliary information clarifying the extent to which the 
sample data will support the model choice. The choice should be made on the 
basis of functional insight and understanding of the relationship between X 
and Y. 


Comparing the correlation coefficients or r-values, produced using the 
real data and the extended data, is likewise not a valid method for choosing 
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between Models I and II. The value of r2 using Model I (observed data only) 


is often referred to as the reduction in variance of the estimator made possible 
by using the apparent association between X and Y. A value of 0) 
indicates that knowledge of the X-values makes no improvement in the prediction 
of Y and using the mean value of the y's as the estimator would not increase 
the sum of the squares of the deviations. At the other extreme if r2=1, all 
sample points lie on a sloping straight line implying a strong predictive value. 
Similarly with Model II, higher r* values indicate improved fit of the data; 
but comparing r* values between Models I and II does not reveal which is 
correct or even preferable. There is a slight conceptual and a substantial 
computational difference between the r* values for the two models. The two 
values should not be compared; both indicate the relative fit of various data 

to their own particular model. Either value can be used to measure "goodness 

of fit" in particular applications; or even to indicate the usefulness of several 
versions of the particular model chosen, For example comparison of r-values 
would indicate whether taking logs of the measurements, or raising them to a 
given power prior to regression, improved the fit. But comparison of the r- 
value would not be a valid basis for choosing between Models I and Il. 


V. EXAMPLES 


The following problems illustrate a frequent need to constrain the regres-— 
sion line in coastal engineering applications. The problems also illustrate 
the usefulness of r2 to rank different predictors in terms of how well they 
fit data. Before initially applying the described method to an actual problem, 
it may be helpful to reanalyze one of the smali data sets used in these examples 
and compare the results with those published in this report. 


kok wk OK kK KK & OK O&O K KOR & & * EXAMPLE PROBLEM 1 * * * * * * KK KR RK KK K 


Consider the requirement to simulate a long-term history of wave-induced 
longshore currents for a particular coastal site. Assume hindcasted wave data 
are available, but that current measurements were not made over the period of 
interest. According to the Shore Protection Manual (U.S. Army, Corps of 
Engineers, Coastal Engineering Research Center, 1977), the longshore current 
(v) can be calculated as a function of the beach slope (m), the gravitational 
acceleration (g), and the angle and height of breaking waves (ap, Hp, 
respectively). 


v= 20.7 m (gip,) 1/2 sin 2ap (1) 


The coefficient of proportionality (20.7) is based on typical mixing and fric- 
tional factors for the surf zone. Empirical formulas, like equation (1) can be 
adjusted by regression analysis of test data from the specific site of intended 
application. This will customize the formula to fit site-sensitive conditions. 
The longshore velocity also varies laterally within the surf zone. The problem 
of estimating the spatial structure of flow across the surf zone may be avoided 
by obtaining current measurements at the exact point where the long-term flow 
must be reconstructed, then regressing the test measurements against simul~ 
taneously determined breaker conditions. Steps in such an analysis are given 
below. Only a few data points are used in the example to encourage the reader 
to go through the computations and check the results. The data are taken from 
a frequently referenced field study done at Nags Head, North Carolina (Galvin 
and Savage, 1966). 


Ks) 


GIVEN: Longshore current velocities (v), breaker heights (H,), breaker 
angles (ap), and the beach slope (m) determined onsite during a short 
field evaluation (see Table 2). 


Table 2. Field calibration data (from 
Galvin and Savage, 1966). 


Obsn. Hp m Vv 
(£t) (c/5) 
1 2 0.03 2.42 
2 3.2 0.026 4.33 
3 1.8 0.029 1.96 
4 8 0.026 1.26 . 


REQUIRED: An equation that will predict wave-induced longshore currents for 
the test site. 


ANALYSIS: Because the linearity expressed in equation (1) has a firm theoreti- 
cal basis in the concept of radiation stress (Longuet-Higgins, 1970), and 
because according to this concept, v = 0 whenever Hp = 0 or op = 0, the 
prediction line must pass through the origin (0, 0). So Model II must be used. 


Let 
Yay 


and 
xX 


m(gHp) !/2 sin 20, 


Regress Y on X to determine the best estimate of the coefficient of 
proportionality between X and Y. 


CORRECT RESULTS: 


Regression coefficient 8 = 17 
Correlation coefficient r = 0.91 
Standard error of 8 SB = 4.6 
Test statistic for 8 t S 367 
Estimated residual variance SGox = 1.8 


CONCLUSION: The version of the Longuet-Higgins type equation that best fits 
this problem site (based on available current data) is: 


v= 17 m (gh) 1/2 sin 20 


NOTE: Fitting the equation to the data in this example produces results closer 
to those obtained with larger data sets (eq. i) if the line is forced through 
the origin rather than being fit strictly to the data without this constraint 
(see Fig. 5). 


Measured Velocity 


Y= 


2 


| : 

X=m(gH,) "8Sin 2ap 

Figure 5. Real test data for example problem 1. Compare 
the correct fit through the origin with the 


customary fit. 
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At least 10 equations relating the velocity of longshore currents to wave 
characteristics have appeared in the literature. Presumably more will appear 
as knowledge increases or theory is adapted to specific wave or bathymetric 
conditions (i.e., specialized for breaker type or bar dimensions). A recent 
article (Komar, 1979) questions the value of including a measure of beach slope 
in the general prediction equation and claims better results for 

v = 0.585(gHp) 1/2 sin 2op 
GIVEN: The same situation and data as in example problem 1. 
REQUIRED: Determine the best fit version of the type 

v= (gh) 1/2 sin 2a 


and compare the results with those obtained in example problem 1 to see if 
the beach slope is indeed of any value at this particular site. 


ANALYSIS: For the same reasons stated in example problem 1, regression 
should require the prediction line to pass through the point (0, 0). 


Let 


Y=v 


Ke (gh) 1/2 sin 20 


and regress Y on X using Model II with its extended data set (Fig. 6). 


IT 


Measured Velocity 


a= 


X= (gH)? Sin 20, 


Figure 6. Real test data for example problem 2 and fitted 
equations. Compare the correct fit through the 
origin with the customary fit. 


CORRECT RESULTS: 


Regression coefficient 8 = 0.46 
Correlation coefficient Te = 0.90 
Standard error of 8 Se = 0.13 
Test statistic for 8 fe = 3.6 
Estimated residual variance Sosg ie Ms) 


CONCLUSION: The best predictor of the Komar type is: 
v= 0.46(gH,) 1/2 sin 2ap 


It would be surprising to find a clear indication of whether beach slope should 
be included in the predictor for longshore currents by evaluating such a 
limited data set as chosen here to encourage reader computation. Indeed a 
comparison of Tables 3 and 4 reveals no significant differences between the 
correlation coefficients or any other test statistics. However, significant 
differences would be expected if a large reliable data set covering a wider 
range of conditions were compared by the methods illustrated in this report. 


Table 3. Extended data set No. 1. 


Obsn. x We 

(ft/s) (ft/s) 
1 0.152 2.42 
-0.152 -2.42 
2 0.162 4.33 
-0.162 -4.33 
3 0.0827 L9G 
-0.0827 -1.96 
4 0.170 Po2l 
-0.170 -1.27 

Table 4. Extended data set No. 2. 

Obsn. x 4 

(ft/s) (ft/s) 

il 5.05 2 on 
-5.05 -2.42 
2 6.25 4.33 
-6.25 -4.33 
3 2.85 1.96 
-2.85 -1.96 
4 6.53 1.27 
-6.53 -1.27 
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